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Abstract 

We consider the following problem that arises in outsourced storage: a user stores her data x on a 
remote server but wants to audit the server at some later point to make sure it actually did store x. The 
goal is to design a (randomized) verification protocol that has the property that if the server passes the 
verification with some reasonably high probability then the user can rest assured that the server is storing 
X. While doing this, we need to minimize the user's storage and the total amount of communication 
between the server and the user This is the data possession problem and is closely related to the problem 
of obtaining a 'proof of retrievability' . Existing schemes with provable guarantees mostly use crypto- 
graphic primitives and can only guarantee that the server is storing a constant fraction of the amount of 
information that it is supposed to store. 

In this work we present an optimal solution (in terms of the user's storage and communication) 
while at the same time ensuring that a server that passes the verification protocol with any reasonable 
probability will store, to within a small additive factor, C{x) bits of information, where C{x) is the 
plain Kolmogorov complexity of x. (Since we cannot prevent the server from compressing x, C{x) is a 
natural upper bound.) The proof of security of our protocol combines Kolmogorov complexity with list 
decoding and unlike previous work that relies upon cryptographic assumptions, we allow the server to 
have unlimited computational power To the best of our knowledge, this is the first work that combines 
Kolmogorov complexity and list decoding. 

Our framework is general enough to capture extensions where the user splits up x and stores the 
fragment across multiple servers and our verification protocol can handle non-responsive servers and 
colluding servers. As a by-product, we also get a proof of retrievability. Finally, our results also have 
an application in 'storage enforcement' schemes, which in turn have an application in trying to update a 
remote server that is potentially infected with a virus. 

Keywords: Kolmogorov Complexity, List Decoding, Data Possession, Proof of Retrievability, Reed-Solomon 
Codes, CRT codes. 
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1 Introduction 



We consider the following problem: A string x € [qf is held by the client/user, then transmitted to a remote 
server (or broken up into several pieces and transmitted to several remote servers). At some later point, after 
having sent x, the client would like to know that the remote server(s) can reproduce x. At the very least 
the client would like to know that as much space was committed by the remote servers to storage as was 
used by x. The fomier problem has been studied under the name of proof of retrievability (cf. [16, 5, 8]) 
and/or data possession (cf. [1, 2, 7, 9]). The latter has been studied under the name of storage enforcing 
commitment [11]. (In these problems there is only one server.) Given the greater prevalence of outsourcing 
of storage (e.g. in "cloud computing"), such a verification procedure is crucial for auditing purposes to make 
sure that the remote servers are following the terms of their agreement with the client. 

The naive solution would be for the client to store x locally and during the verification stage, ask the 
server(s) to send back x. However, in typical applications where storage is outsourced, the main idea behind 
the client shipping x off to the servers is that it does not want to store x locally. Also, asking the server(s) 
to send back the entire string x is not desirable because of the communication cost. Thus, we want to 
design a verification protocol such that the bandwidth c used for the challenge and challenge response is low 
compared with the size of x (which we will denote by |x|) and such that the local storage m for the challenger 
should be low (far less than |x|). It is not hard to see that a deterministic protocol will not be successful. 
(The server could just store what the client stores and compute the correct answer and send it back.) Thus, 
we need a randomized verification protocol which catches "cheating" sei^vers with high probability. In other 
words, if the server(s) passes the verification with probability at least 8, then the server(s) must necessarily 
have to have stored a large portion of x. 

There are many other parameters for such a protocol and we highlight some of them here. First, one has 
to quantify the "largeness" of the portion of x that the server(s) are forced to store. Typically, existing results 
show that if server(s) pass the verification process with probability £, then it stores (or is able to re-create) 
a constant fraction of x. Second, one has to fix the assumptions on the computational power of the client 
and the servers. Typically, all of the honest parties are constrained to be polynomial time algorithms. In 
addition, one might want to minimize the query complexity of an honest server while replying back to a 
challenge (e.g. [8]). Alternatively, one might only allow for the client and server to use one pass low space 
data stream algorithms [6].^ Third, one needs to decide on the computational power allowed to a cheating 
server. Typically, the cheating server is assumed to not be able to break cryptographic primitives. Finally, 
one needs to decide whether the cUent can make an unbounded number of audits based on the same locally 
stored data. Many of the previous work based on cryptographic assumptions allows for this. 

Before we move on to our result, we would like to make a point that does not seem to have been 
made explicitly before. Note that we cannot prevent a server from compressing their copy of the string, or 
otherwise representing it in any reversible way. (Indeed, the user would not care as long as the server is able 
to recreate x.) This means that a natural upper bound on the amount of storage we can force a server into is 
C(x), the (plain) Kolmogorov complexity of x, which is the size of the smallest (algorithmic) description of 

X. 

In all of our results, if the server(s) pass the verification protocol with probability £ > 0, then they 
provably have to store C(x) bits of data up to a very small additive factor. Cheating servers are allowed 
to use arbitrarily powerful algorithms (as long as they terminate) while responding to challenges from the 
user.^ Further, every honest server only needs to store x (or its portion of x). In other words, unlike some 

' [6] actually looks at a more general problem where the client wants to verify whether the server has correctly done some 
computation on x. In our case, one can think of the outsourced computation as the identity function. 

^The server(s) can use different algorithms for different strings x but the algorithm cannot change across different challenges for 
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existing results, our protocol does not require the server(s) to store a massaged version of x. In practice, this 
is important as we do not want our auditing protocol to interfere with "normal" read operations. However, 
unlike many of the existing results based upon cryptographic primitives, our protocol can only allow a 
number of audits that is proportional to the user's local storage.^ For a more detailed comparison with 
existing work, especially in the security community, see section A. 

Our main result, which might need the user and honest servers to be exponential time algorithms, prov- 
ably achieves the optimal local storage for the user and the optimal total communication between the user 
and the server(s). With slight worsening of the storage and communication parameters, the user and the 
honest servers can work with single pass logarithmic space data stream algorithms. Finally, for the multiple 
server case, we can allow for arbitrary collusion among the servers if we are only interested in checking if at 
least one sever is cheating (but we cannot identify it). If we want to identify at least one cheating server (and 
allow for servers to not respond to challenges), then we can handle a moderately Umited collusion among 
servers. 

Even though we quantitatively improve existing results (at least on most parameters), we believe that the 
strongest point of the paper is the set of techniques that we use. In particular, our proofs naturally combine 
the notions of Kolmogorov complexity and list decoding. These two notions have been used widely in 
computational complexity but to the best of our knowledge ours is the first one to use both at the same time. 
Next we present more details on our techniques. 

Our techniques: In this part of the paper, we will mostly concentrate on the single server case. 

To motivate our techniques, let us look at the somewhat related problem of designing a communication 
protocol for the set/vector equality problem. In this problem Alice is given a string x and Bob is given 
y and they want to check if x = y with a minimum amount of communication. (One could think of y as 
the "version" of x that Bob (aka the server) stores for x.)^ The well-known fingerprinting protocol picks a 
random hash function h (using some public randomness) and Alice sends Bob h{x) who checks if h{x) = 
h{y). Two classical hash functions h{x) correspond to taking x mod a random prime (a.k.a. the Karp Rabin 
fingerprint [17]) and thinking of x as defining a polynomial Px(J) and evaluating P{Y) at a random field 
element. 

It is also well-known that both of the hashes above are a special case of the following class of hash 
functions. Let H : [qY — > [<?]" be an error-correcting code with lai^ge distance. Then a random hash involves 
picking a random P G [n] and defining /zp(x) = //(x)p. (The Karp Rabin hash corresponds to H being the 
so called "Chinese Remainder Theorem" (CRT) code and the polynomial hash corresponds to H being the 
Reed-Solomon code.) Our result needs H to have good list decodable properties. (We do not need an 
algorithmic guarantee, as just a combinatorial guarantee suffices.) 

The user picks a random p and stores {^,H{x)^). During the verification phase, it sends P to the server 
and asks it to compute H{x)^. If the server's answer a / H{x)p it rejects, otherwise it accepts. We now 
quickly sketch why the server cannot get away with storing a vector y such that \y\ is smaller than C{x) 
by some appropriately small additive factor. Since we are assuming that the server uses an algorithm to 
compute its answer J?(P, y) to the challenge P, if the server's answer is accepted with probability at least £, 
then note that the vector (J?(P,y))pe[„] differs from H{x) in at most 1 — £ fraction of positions. Thus, if H 

the same x. 

■'An advantage of proving security under cryptographic assumptions is that one can leverage existing theorems to prove addi- 
tional security properties of the verification protocol. We do not claim any additional security guarantees other than the ability of 
being able to force servers to store close to C{x) amounts of data. 

^The main difference is that in this communication complexity problem, Alice and Bob are both trying to help each other while 
in our case Bob could be trying to defeat Alice. 
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has good list decodability, then (using one can compute a list {xi, . . . ,xi} that contains x. Finally, one 
can use logL bits of advice (in addition to y) to output x. This procedure then becomes a description for x 
and if \y\ is sufficiently smaller than C{x), then our description will have size < C{x), which contradicts the 
definition of C{x). 

The trivial solution for the multiple server case would be to run independent copies of the single server 
protocol for each server. Our techniques very easily generalize to the multiple server case, which lead to 
a protocol whose user storage requirement matches that of the single server case, which is better than the 
trivial solution. For this generalization, we need // to be a linear code (in addition to having good list 
decodability). Further, by applying a systematic code on the hashes for the trivial solution and only storing 
the parity symbols, one can also handle the case when some servers do not respond to challenges while 
using user storage that is somewhere in between that of the trivial solution and the single server protocol. 

From a more practical point of view, Reed-Solomon and CRT codes have good list decodability, which 
implies that our protocols can be implemented using the classical Kaip Rabin and polynomial hashes. 

Another application: We believe that our results and especially our techniques should be more widely 
applicable. Next we briefly mention an application of our result to a practical problem that was pointed out 
to us by Dick Lipton. 

Assume that the user wishes to update an operating system on a remote computer, but is concerned that 
a virus on the remote machine may be listening at the network device, intercepting requests and answering 
them without actually installing the operating system, or by installing the operating system and cleverly 
reinserting the virus while doing so. 

Our single sei^ver protocol can be used, along with a randomly -generated string x, (which will, with 
constant probability greater than 1 /2, have C{x) ^ {\x\ — 0(1)))^, to force the remote machine to first store a 
string of length close to \x\ (overwriting any virus as long as \x\ is chosen large enough), second, to correctly 
answer a verification request (a failure to answer the request proves that the install has failed), and third 
(now that there is no room for a virus) to install the new operating system. 

2 Preliminaries 

We use to denote the finite field over q elements. We also use [n] to denote the set {1,2, ..,n}. Given any 
string X G F*, we use \x\ to denote the length of x in bits. Additionally, all logarithms will be base 2 unless 
otherwise specified. 

2.1 Verification Protocol 

We now formally define the different parameters of a verification protocol. We use U to denote the user/client. 
We assume that U wants to store its data x G among s service providers Pi,... ,Pv- In the pre-processing 
step of our setup, U sends x to = {Pi,. .■,Ps} by dividing it up equally among the s servers - we will 
denote the chunk sent to server / G [s] as x,- € ¥q^\^ Each server is then allowed to apply any computable 
function to its chunk and to store a string y,- G F*. Ideally, we would like y,- = x,-. However, since the sei-vers 
can compress x, we would at the very least like to force \yi\ to be as close to C(x,) as possible. For notational 

^The exact bound is very strong. C{x) ^ ( \x\ — r) for at least a ( 1 — 1 /2'') -fraction of the strings of lengthi . 
^We will assume that s divides n. Further, in our results for the case when H is a linear code, we do not need the x;'s to have the 
same size, only that x can be paritioned into xi,. .. ,Xs. We will ignore this possibility for the rest of the paper 
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convenience, for any subset T C [s], we denote yj (xj resp.) to be the concatenation of the strings {j,},- G T 
{{xijieT resp.). 

To enforce the conditions above, we design a protocol. We will be primarily concerned with the amount 
of storage at the client side and the amount of communication and want to minimize both simultaneously 
while giving good verification properties. The following definition captures these notions. (We also allow 
for the servers to collude among each other.) 

Definition 1. Let s, c,m^ 1 and be integers, ^ I be a real and f : [g]* IR^o be a function. 

Then an {s,r)-pa.rty verification protocol with resource bound {c,m) and verification guarantee (p,/) is a 
randomized protocol with the following guarantee. For any string x € [q]'^, U stores at most m bits and 
communicates at most c bits with the s servers. At the end, the protocol either outputs a I or a 0. Finally, 
the following is true for any T C [s] with \T\ ^r: If the protocol outputs a 1 with probability at least p, then 
assuming that every server i G \s] \ T followed the protocol and that every server in T possibly colluded with 
one another, we have {yr] ^ fixj). 

We will denote a (1, l)-party verification protocol as a one-party verification protocol. (Note that in this 
case, the single sei^ver is allowed to behave arbitrarily.) 

All of our protocols will have the following structure: we first pick a family of "keyed" hash functions. 
The protocol will pick random key(s) and store the corresponding hash values for x (along with the keys) 
during the pre-processing step. During the verification step, U sends the key(s) as challenges to the s 
servers. Throughout this paper, we will assume that each server / has a computable algorithm J^^^j such that 
on challenge P it returns an answer Rxj{^,yi) to U. The protocol then outputs 1 or by applying a (simple) 
boolean function on the answers and the stored hash values. 

2.2 List Decodability 

We begin with some basic coding definitions. An (error-correcting) code H with dimension I and block 
length n ^ k over an alphabet of size q is any function H : [q]'' — )■ [q]". A linear code H is any eiTor- 
correcting code that is a linear function, in which case we correspond [q] with F^,. A message of a code H 
is any element in the domain of H. A codeword in a code H is any element in the range of H. 

The Hamming distance A{x,y) of two same-length strings is the number of symbols in which they differ. 
The relative distance 5 of a code is min^^y ^T^' where x and y are any two different codewords in the code. 

Definition 2. A {p,L) list-decodable code is any error-correcting code such that for every codeword e in the 
code, the set E' of codewords that are Hamming distance pn or less from e is always L or fewer. 

Geometric intuition for a (p , L) list-decodable code is that it is one where Hamming balls of radius pn 
centered at arbitrary vectors in [q]" always contain L or fewer other codewords. 

For the purposes of this paper, we only consider codes H that ai^e members of a family of codes 9f, any 
one of which can be indexed by {k,n,q,pn). Note that not all values of the tuple {k,n,q,pn) need represent 
a code in just that each Hi G 9{ has a distinct such representation. 

2.3 Plain Kolmogorov Complexity 

Definition 3. The plain Kolmogorov Complexity C{x) of a string x is the minimum sum of sizes of a com- 
pressed representation ofx, along with its decoding algorithm D, and a reference universal Turing machine 
T that runs the decoding algorithm. 
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Because the reference universal Turing machine size is constant, it is useful to think of C{x) as simply 
measuring the amount of inherent (i.e. incompressible) information in a string x. 

Most strings cannot be compressed beyond a constant number of bits. This is seen by a simple counting 
argument. C(x) measures the extent to which this is the case for a given string. 

3 One Remote Party Result 

We begin by presenting our main result for the case of one server (s = 1) to illustrate the combination of 
list decoding and Kolmogorov complexity. In the subsequent section, we will generahze our result to the 
multiple server case. 

Theorem 1. For every computable error-correcting code H : [qf' — )■ [qY that is (p,L) list-decodable, there 
exists a one-party verification protocol with resource bound (log?i + log^,logn + logg) and verification 
guarantee (1 — p,/), where for any x G [q\^, f{x) = C{x) — \og{qLn^) — 21oglog(g?i) — cq, for some fixed 
constant cqJ 

Proof. We begin by specifying the protocol. In the preprocessing step, the client U does the following on 
input X ^ [q]'': 

1. Generate a random (3 G [n]. 

2. Store (P,y = and send x to the server. 

The server, upon receiving x, saves a string y G [q]*. The server is allowed to use any computable 
function to obtain y from x. 

During the verification phase, U does the following: 

1 . It sends P to the server. 

2. It receives a G [q] from the server, (a is supposed to be H{x)p.) 

3. It outputs 1 (i.e. server did not "cheat") if a = y, else it outputs a 0. 

We assume that the server, upon receiving the challenge, uses a computable function J^,- : [n] x [^] * ^ [q] 
to compute a = .^^{^,y) and sends a back to U. 

The claim on the resource usage follows immediately from the protocol specification. Next we prove its 

verification guarantee by contradiction. Assume that \y\ < f{x) C{x) —\og{qLn^) —2\og\og{qn) — cq and 
yet the protocol outputs 1 with probability at least 1 — p (over the choice of P). Define z = {Rx{^-,y))\^e[n]- 
Note that by the claim on the probability, A(z,//(jc)) ^ p?i. We will use this and the list decodability of the 
code H to prove that there is an algorithm with description size < C{x) to describe x, which is a contradiction. 
To see this, consider the following algorithm that uses y and an advice string v G {0, l}'^' : 

1 . Compute a description of H from n,k,pn and q. 

2. Compute z = (J?,(P,3;))pe[„]. 

^The contribution from the encoding of the constants in this theorem to cq is 2. For most codes, we can take cq to be less than 
a few thousand. What is important is that the contribution from the encoding is independent of the rest of the constants in the 
theorem. Although we do not explicitly say so in the body of the theorem statements, this important fact is true for the rest of the 
results in this paper as well. 
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3. By cycling through all x G [q]^, retain the set L C [q]^ such that for every L, A{H{u),z) ^ pn. 

4. Output the vth string from L. 

Note that since H is (p,L)-list decodable, there exists an advice string v such that the algorithm above 
outputs X. Further, since H is computable, there is an algorithm £ that can compute a description of H 
from n,k,pn and q. (Note that using this description, we can generate any codeword H{u) in step 3.) Thus, 
we have description of x of size \y\ + |v| + \Jlx\ + I'El + (31ogn + logg + 21oglog?i + 21oglog(7 + 2) (where 
the last term is for encoding the different parameters^), which means if j^^l < C(x) — |v| — \Rx\ — ["Ej — 
(3 log n + logq + 2 log log n + 2 log log q + 2) = f{x), then we have a description of x of size <C{x), which 
is a contradiction. 

□ 

The one unsatisfactory aspect of the result above is that if H is not polynomial time computable, then 
Step 2 in the pre-processing step for U is not efficient. Similarly, if the sever is not cheating (and e.g. stores 
y = x), then it cannot also compute the correct answer efficiently. We will come back to these issues when 
we instantiate H by an explicit code such as Reed-Solomon. 

Remark 1. IfH has relative distance 6, then note that if our protocol has verification guarantee (1 — 8/2 — 
Z,C{x) — \og{qn^) — 21oglog(^«) - co) for some fixed constant co, then y has enough information for the 
server to compute x back from it. (It can use the same algorithm to compute xfrom y detailed above, except 
it does not need the advice string v, as in Step 3, we will have L = {x}.) For the more general case when H is 
{p,L)-list decodable and our protocol has verification guarantee (1 — p,C(x) — log qLn^ — 21oglog<7?i — cq), 
then y has enough information for the server to compute a list L D {x} with \L\ ^ L. The client, if given 
access to L, can use its local hash to pick x out of L with probability at least 1 — 5L. 

4 Multiple Remote Party Result 

In the first two sub-sections, we will impUcitly assume the following: (i) We are primarily interested in 
whether some server was cheating and not in identifying the cheater(s) and (ii) We assume that all servers 
always reply back (possibly with an incorrect answer). 

4.1 Trivial Solution 

We begin with the following direct generalization of Theorem 1 to the multiple server case: essentially run 
s independent copies of the protocol from Theorem 1 . 

Theorem 2. For every computable error-correcting code H : [q]'^^" [q]" that is {p,L) list-decodable, 
there exists an (s,s)-party verification protocol with resource bound (logn + slogq,s{logn + logq)) and 
verification guarantee (1— p,/), where for any x & [q]^, f{x) =C{x) — s — log{s'^qUn^) — 2loglog{qn) —cq, 
for some fixed positive integer cq. 

Proof. We begin by specifying the protocol. In the pre-processing step, the client U does the following on 
input X £ [q]'^: 

^We use a simple self-delimiting encoding of q and n, followed immediately by k and pn in binary, with the remaining bits used 
for V. A simple self-delimiting encoding for a positive integer u is the concatenation of: ([log(|;(|)] in unaiy, 0, in binary, u in 
binary). We omit the description of this encoding in later proofs. 
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1. Generate a random (3 G [n]. 

2. Store (P, Yi = H{xi )[5, . . . , = and send x; to the server / for every / G [s]. 

Server / on receiving x, saves a string yi G [^]*. The server is allowed to use any computable function to 
obtain from x,-. 

During the verification phase, U does the following: 

1. It sends (3 to all s servers. 

2. It receives a,- G [^7] from server / for every / G [s]. {at is supposed to be //(x,)p.) 

3. It outputs 1 (i.e. none of the servers "cheated") if a,- = y, for every / G [s], else it outputs a 0. 

Similar to the one-party result, we assume that sei^ver /, on receiving the challenge, uses a computable 
function Slx,i '■ H x — ^ W\ to compute a,- = and sends a,- back to U . 

The claim on the resource usage follows immediately from the protocol specification. Next we prove 
its verification guarantee. Let T C [s\ be the set of colluding servers. We will prove that yj is large by 
contradiction: if not, then using the list decodability of H, we will present a description of xj of size 

< C{xj). Consider the following algorithm that uses yj and an advice string v G ({0, 1}'^') , which is the 
concatenation of shorter strings v,- G ({0, 1}'^') for each / G T: 

1. Compute a description of H from n,k, pn,q and s. 

2. For every 7 G T, compute zj = (-!?x,;(P,3'y))pe[«]- 

3. Do the following for every 7 G T: by cycling through all xj G [q]'^^'\ retain the set Lj C [^]^/* such that 
for every u G Lj, A{H{u),Zj) ^ pn. 

4. For each j G T, let wj be the vyth string from Lj. 

5. Output the concatenation of {wj}jeT- 

Note that since H is (p,L)-list decodable, there exists an advice string v such that the algorithm above 
outputs XT- Further, since H is computable, there is an algorithm £ that can compute a description of H 
from n,kpn,q and s. (Note that using this description, we can generate any codeword H{u) in step 3.) 
Thus, we have description of xj of size |j7-| + |v| +Lyer l-^vjl + l^^l + (>s' + log(5'^gL*?i^) + 21oglog(^«) + 3) 
(where the term in parentheses is for encoding the different parameters and T), which means that if |yr | < 
C{xt) — \v\ — Lyer \-^xj \ — I "El — {s + log{s^qUn'^) + 21oglog(^n) + 3) = f{x), then we have a description 
of XT of size < C{xt), which is a contradiction. □ 

4.2 Multiple Parties, One Hash 

One somewhat unsatisfactory aspect of Theorem 2 is that the storage needed by U goes up a factor of s from 
that in Theorem 1 . Next we show that if the code H is linear (and list decodable) then we can get a similar 
guarantee as that of Theorem 2 except that the storage usage of U remains the same as that in Theorem 1 . 

Theorem 3. For every computable linear error-correcting code // : that is (p,L) list-decodable, 

there exists an {s,s)-party verification protocol with resource bound (logn + log^,5'(logn + log^)) andver- 
ification guarantee (1 — p,/), where for any x G F^, f{x) = C{x) — s — log{s'^qLn^) — 21oglog(^?i) — cq, for 
some fixed positive integer cq. 
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Proof. We begin by specifying the protocol. In the pre-processing step, the client U does the following on 
input [qf: 

1 . Generate a random P G [n]. 

2. Store (p,y = H{x)i^) and send Xj to the server / for every / G [s\. 

Server / on receiving Xj, saves a string jj € [q]* . The server is allowed to use any computable function 
to obtain yt from x,-. For notational convenience, we will use x, to denote the string x, extended to a string in 
by adding zeros in positions that correspond to servers other than /. 
During the verification phase, U does the following: 

1. It sends P to all s servers. 

2. It receives a,- € [^7] from server / for every / € [s\. (aj is supposed to be //(x,)p.) 

3. It outputs 1 (i.e. none of the servers "cheated") if y = ^-^j a, else it outputs a 0. 

We assume that server / on receiving the challenge, uses a computable function Jl^ ,• : [n] x [^]* — > [17] to 
compute a,- = and sends a,- back to U. 

The claim on the resource usage follows immediately from the protocol specification. Next we prove 
its verification guarantee. Let T C [s] be the set of colluding servers. We will prove that yj is large by 
contradiction: if not, then using the list decodability of H, we will present a description of xj of size 
<C{xt). 

For notational convenience, define xj = Y^jeT^j ^^id Xj = I^^yrXy. Consider the following algorithm 
that uses yj and an advice string v G {0, l}'^': 

1. Compute a description of H from n,k,p,q,s and L. 

2. Compute z = (YjeT ^xj{^,yj))peln]- 

3. By cycling through all x G F^, retain the set L Q¥'^^ such that for every u ^ L, A{H{u),z) ^ pn. 

4. Output the vth string from L. 

To see the correctness of the algorithm above, note that for every j G [s] \ T, {.^xj{^,yj))pe[n] = H{xj). 
Thus, if the protocol outputs 1 with probability at least 1 — p, then 5{z,H{xt)) ^ pn ; here we used the 
linearity of H to note that H{xj) =H{x) —H{xj). Note that since H is (p,L)-list decodable, there exists an 
advice string v such that the algorithm above outputs xj (from which we can easily compute xj). Further, 
since H is computable, there is an algorithm £ that can compute a description of H from s,n,k,pn and q. 
Thus, we have a description of X7 of size \yT\ + \v\ +Lier l-^vjl + l^l + {s+log{s^qLn'^)+2loglog{qn) + 3), 
(where the term in parentheses is for encoding the different parameters and T), which means that if | jj- 1 < 
C{xt) — |v| — |J?v| — ["El — {s + \og{s^qLn'^) +21oglog(^?i) + 3) = /(x), then we have a description of xj- of 
size < C{xt), which is a contradiction. □ 
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4.3 Catching the Cheaters and HandUng Unresponsive Servers 

We now observe that since the protocol in Theorem 2 checks each answer a, individually to see if it is the 
same as y,, it can easily handle the case when some server does not reply back at all. Additionally, if the 
protocol outputs a then it knows that at least one of the servers in the colluding set is cheating. (It does not 
necessarily identify the exact set T.^) 

However, the protocol in Theorem 3 cannot identify the cheater(s) and needs all the servers to always 
reply back. Next, using Reed-Solomon codes, at the cost of higher user storage and a stricter bound on the 
number of colluding servers, we show how to get rid of these shortcomings. 

Recall that a Reed-Solomon code RS : F^^ can be represented as a systematic code (i.e. the first k 

symbols in any codeword is exactly the corresponding message) and can correct r errors and e erasures as 
long as 2r + e ^ £ — m. Further, one can correct from r errors and e erasures in 0{i^) time. The main idea 
in the following result is to follow the protocol of Theorem 2 but instead of storing all the s hashes, U only 
stores the parity symbols in the corresponding Reed-Solomon codeword. 

Theorem 4. For every computable linear error-correcting code // : F^ — > F^^ that is {p,L) list-decodable, 
assuming at most e servers will not reply back to a challenge, there exists an (r, s)-party verification protocol 
with resource bound (log n + {2r + e)- log q) , ^(log n + log q) ) and verification guarantee ( 1 — p , /), where 
for any x G F^, /(x) = C(x) — s — log{s^qLn^) — 21oglog(g?i) — co, for some fixed positive integer cq. 

Proof. We begin by specifying the protocol. As in the proof of Theorem 3, define x,-, for / G [s], to be the 
string X, extended to the vector in F^, which has zeros in the positions that do not belong to server /. Further, 
for any subset T C [5], define xt = Lier-^f- Finally let /?5 : F* — )• F^^^ be a systematic Reed-Solomon code 
where £ = 2r + e -\- s. 

In the pre-processing step, the client U does the following on input x G [q]^: 

1 . Generate a random P G [n] . 

2. Compute the vector v = (//(xi)p, . . . ,H{xs)p) G F*. 

3. Store (P, yi = /?S(v).v+i , . . . ,y2)+e = RS(y()) and send x,- to the server / for every / G [s]. 

Server / on receiving x;, saves a string yi G [^j']*. The server is allowed to use any computable function to 
obtain from x,-. 

During the verification phase, U does the following: 

1. It sends p to all s servers. 

2. For each server / G [s], it either receives no response or receives a, G F^. (a, is supposed to be //(x,)p.) 

3. It computes the received word z G F^, where for / G [s], zt =? (i.e. an erasure) if the /th server does not 
respond else Zi = a/ and for s <i ^l, Zi = y,- 

4. Run the decoding algorithm for RS to compute the set T' C [s] to be the error locations. (Note that by 
Step 2, U already knows the set E of erasures.) 

^We assume that identifying at least one server in the colluding set is motivation enough for servers not to collude. 
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We assume that sei-ver / on receiving the challenge, uses a computable function Sixj '■ [n] x [17] * — )• [q] to 
compute Qi = J^vlPiJ;) and sends a,- back to U (unless it decides not to respond). 

The claim on the resource usage follows immediately from the protocol specification. We now prove 
the verification guarantee. Let T be the set of colluding servers. We will prove that with probability at least 
\ — p,U using the protocol above computes % y^T' QT (and \yT \ is large enough). Fix a P G [n]. If for this 
p, U obtains T' = 0, then this implies that for every / € [s] such that server / responds, we have a, = //(i,)p. 
This is because of our choice of RS, the decoding in Step 4 will return v (which in turn allows us to compute 
exactly the set T' C T such that for every j G T', aj ^ //(i)p)."' Thus, if the protocol outputs a T' 7^ 
with probability at least 1 — p over the random choices of P, then using the same argument as in the proof 
of Theorem 3, we note that A{H (xt) , {Y,jeT ■^xj{^,yj))f,e[n]) ^ P'J- Again, using the same argument as in 
the proof of Theorem 3 this implies that {yj \ ^ C{xt) — s — log{s'^qLn^) — 21og log(^?i) — cq, for some fixed 
positive integer cq. □ 

5 Corollaries 

We now present specific instantiations of list decodable codes H to obtain corollaries of our main results. 
5.1 Optimal Storage Enforcement 

We begin with the following observation: If the reply from a server comes from a domain of size q, then 
one cannot hope to have a verification protocol with verification guarantee (5,/) for any 8 ^ 1/g for any 
non-trivial /. This is because the server can always return a random value and no matter what function U 
uses to compute its final output, the server will always get a favorable response with probability l/q.^^ 

Next, we show that we can get 5 to be arbitrarily close to 1/^ while still obtaining f{x) to be very close 
to C(x). We start off with the following result due to Zyablov and Pinsker: 

Theorem 5 ([26]). Let q ^ 2 and let < p < I — l/q. There exists a {p,L)-list decodable code with rate 
l-H,{p)-\/L. 

It is known that for £ < \/q, H^{\ -l/q-z) ^ 1 -Qs^, where Cq = q/{A\nq) [20, Chap. 2]. This 
implies that there exists a code : FJ^ , with « ^ ^8^1n^/(^e^),whichis(l-l/^-£,l/£^)- 
list decodable. Note that the above implies that one can deterministically compute a uniquely-determined 
such code by iterating over all possible codes with dimension k and block length n and outputting the 
lexicographically least such one that is (1 — 1/^ — 8,L)-list decodable with the smallest discovered value of 
L. Applying this to Theorem 2 implies the following result: 

Corollary 6. For every z <\/q and integer s ^ I, there exists an {s,s) -party verification protocol with 
resource bound {\ogk + {s— l)log^ — 21oge-f |loglog^ + 3),5(log^ — 21og£+ |loglogg + 3)) and verifi- 
cation guarantee {\/q + z,f), where for any x £ [qY, f{x) = C(x) — i' — logi^^"* +log£^*+^ — loglog^^^^ + 
log log £^ — log log log — Co for some fixed positive integer cq. 

Some of our results need H to be linear. To this end, we will need the following result due to Guruswami 
etal.i2 

"'We will assume that T HE = If not, just replace T by T\E. 

"For the single server case, the server can take y to be the empty string. For the multiple server case where T C [s] is the 
colluding set of servers, the colluding servers can, for example, ensure that they answer correctly for all but one server i e T and 
not store anything for server 

'^The corresponding result for general codes has been known for more than thirty years. 
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Theorem 7 ([15]). Let q^lbe a prime power and letO<p< l — l/q. Then a random linear code of rate 
1 — Hq{p) — £ is {p,Cp^q/e)-list decodable for some term Cp^q that just depends on p and q. 

As a Corollary the above implies (along with the arguments used earlier in this section) that there exists 

a Unear code // : ^ with n %k\nq/{qz^) that is (1 - 1/g - £,C;^/£2)-hst decodable (where C'^ q =^ 
C\-\/q-^,q)- Applying this to Theorem 4 gives us the following: 

Corollary 8. For every z <\/q, integer s ^ \, and r,e ^ s, assuming at most e servers do not reply back 
to a challenge, there exists an {r,s)-party verification protocol with resource bound {logkq^''^'^^^ — log£^ + 
loglogg'^/^ + 3,5(log^ — log£^ + loglog^^/^ + 3)) and verification guarantee {l/q + E,f), where for any 
X G [q]'^, f{x) = C{x) — s — logi'^Cg qk^ + log^^£^° — \o%\ogq^k^ + loglog£^ — logloglog^^ — cq for some 
fixed positive integer cq. 

5.2 Practical Storage Enforcement 

All of our results so far have used computable codes H, which are not that useful in practice. What we 
really want in practice is to use codes H that lead to an efficient implementation of the protocol. At the very 
least, all the honest parties in the verification protocol should not have to use more than polynomial time to 
perform the required computation. An even more desirable property would be for honest parties to be able 
to do their computation in a one pass, logspace, data stream fashion. In this section, we'll see one example 
of each. Further, it turns out that the resulting hash functions are classical ones that are also used in practice. 

5.2.1 Johnson Bound 

Before we instantiate H with specific codes, we first state a general combinatorial result for list decoding 
codes with large distance, which will be useful in our subsequent corollaries. The result below allows for 
a sort of non-standard definition of codes, where a codeword is a vector in IT'Lil^;]' where the ^;'s can be 
distinct.'^ (So far we have looked only at the case where qj = q for / G [n].) The notion of Hamming distance 
still remains the same, i.e. the number of positions that two vectors differ in. (The syntactic definitions of 
the distance of a code and the list decodability of a code remain the same.) We will need the following 
result: 

Theorem 9 ([14]). Let C be a code with block length n and distance d where the ith symbol in a codeword 
comes from [qi\. Then the code is n^^^'i=i '^^^^ decodable. 

5.2.2 Hashing Modulo a Random Prime 

We will begin with a code that corresponds to the classical Karp-Rabin hash [17]. Let H be the so called Chi- 
nese Remainder Theorem (or CRT) codes. In particular, we will consider the following special case of such 
codes. Let pi ^ pi ^ ■ ■ ■ ^ Pn^^ the first n primes. Consider the CRT code H : HLi iPi] ~^ ITLi iP']' where 
the message x € {0, 1, ... , (n?=i Pi) — 1}> is mapped to the vector (x mod pi ,x mod p2,... ,x mod p„) € 
n"=i [Pi]- It is known that such codes have distance n — k+\ (cf. [14]). By a simple upper bound on the 
prime counting function (cf. [3]), we can take p„ ^ Inlogn. Moreover, Y!i=\ Pi < npnf^ (cf. [19]). Thus, if 
we pick a CRT code with n = k/e^, then by Theorem 9, // is (1 — £,^^(log^ — log£^) /£^)-list decodable. 

'■'We're overloading the product operator Yl here to mean the iterated Cartesian product. 
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Further, note that given any ;c G {0, 1, . . . , (HLi Pi) ~ 1} ^^'^ ^ random P G [n], H{x)p corresponds to the 
Karp-Rabin fingerprint (modding the input integer with a random prime). Further, H{x)p can be computed 
in polynomial time. 

Thus, letting H be the CRT code in Theorem 2, we get the following: 

Corollary 10. For every £ > 0, there exists an {s,s) -party verification protocol with resource bound 

( + 1 ) log ^ + 5 - log + 5 log log (^/S^ ) , 5 log ^ - 5 log £^ + i' + 5' log log (/:/£^ ) ) 

with verification guarantee (s,/), where for every a; G {0, 1, . . . ,nLi Pi ~^}' 

f{x) = C{x) - CQ{s{\og{k/z) -\og\og{k/z) - 1)) -cilogloglog(Ve) -C2 

for some fixed positive integers cq, ci and C2- Further, all honest parties can do their computation in poly («) 
time. 

Remark 2. Theorem 4 can be extended to handle the case where the symbols in codewords of H are of 
different sizes. However, for the sake of clarity we refrain from applying CRT to the generalization of 
Theorem 4. Further, the results in the next subsection allow for a more efficient implementation of the 
computation required from the honest parties. 

5.2.3 Reed-Solomon Codes 

Finally, we take // : — )■ F'^ to be the Reed-Solomon code, with n = q. Recall that for such a code, given 
message ;c = (:ico, • • ■ G F^, the codeword is given by H{x) = (P.(p))peF,, where P,{Y) = Zt^x^Yr It 

is well-known that such a code H has distance n — k+l. Thus, if we pick n = k/z^, then by Theorem 9, H 
is (1 - £,2/:^/£'^)-hst decodable. 

Further, note that given any x G F^ and a random P G [«], Fl{x)i^ corresponds to the widely used "poly- 
nomial" hash. Further, H{x)^ can be computed in one pass over x with storage of only a constant number 
of F^ elements. (Further, after reading each entry in x, the algorithm just needs to perform one addition and 
one multiplication over F^.) 

Thus, applying H as the Reed-Solomon code to Theorems 2 and 4 implies the following: 

Corollary 11. For every £ > 0, 

(i) There exists an {s,s)-party verification protocol with resource bound {{s+ l)(21og^ + 41og(l/£) + 
l),25(21og/: + 41og(l/£) + 1)) and verification guarantee (£,/), where for any x G F^, f{x) = C{x) — 
0{s{logk + log{l /£))). 

(ii) Assuming at most e servers do not respond to challenges, there exists an {r,s) -party verification pro- 
tocol with resource bound {{2r + e+ l)(21ogA: + 41og(l/£) + l),25(21ogA: + 41og(l/£) + 1)) and 
verification guarantee (£,/), where for any x G F^, f{x) = C{x) — 0{s + log^ + log(l/£)). 

Further, in both the protocols, honest parties can implement their required computation with a one pass, 
C?(log^ + log(l/£)) space (in bits) and (5(log^ + log(l/£)) update time data stream algorithm. 
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A Tables 

Existing approaches for data possession verification at remote storage can be broadly classified into two 
categories: Crypto-based and Coding based. Crypto-based approaches rely on symmetric and assymetric 
cryptographic primitives for proof of data possession. Ateniese et al. [1] defined the proof of data possession 
(PDP) model which uses public key homomorphic tags for verification of stored files. It can also support 
public verifiability with a slight modification of the original protocol by adding extra communication cost. 
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Table 1: Summary of the existing schemes compared to the proposed scheme, AIS. Method denotes the 
technique primarily used for data possession verification by a scheme. The row proof of storage enforcement 
refers to whether a scheme forces a server to store as much as the original data. The capability of a trusted 
third party to verify possession of data is the public verifiability metric. Retrievability denotes the capability 
to retrieve the original data from the responses of the server. Finally, corruption detection refers to whether 
the verification is for complete data or partial data. When complete data is verified, any modification of data 
will be detected whereas in partial verification, a small fraction of corruption can remain undetected. The 
Unlimited audit row refers to whether the scheme is for bounded or unbounded usage. The term NEC refers 
to No Explicit Construction. Please note that there are different variations of POR and PDP in the existing 
literature. For the sake of brevity, we only consider the original proposals in this comparison table, whereas 
other variations are discussed in the related works section. 



Metric/Scheme 


SFC [21] 


POR [8] 


SDS [24] 


HAIL [4] 


DDP [10] 


PDP [1] 


EPV [25] 


SEC [12] 


AIS 


Method 


Code 


Code 


Code 


Code 


Crypto 


Crypto 


Crypto 


Crypto 


Code 


Proof of Storage Enforcement 


NEC 


NEC 


NEC 


NEC 


NEC 


NEC 


NEC 


Yes 


Yes 


Public Verifiability 


Yes 


NEC 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Retrievability 


NEC 


Yes 


NEC 


NEC 


NEC 


NEC 


Yes 


NEC 


Yes 


Corruption Detection 


Complete 


Partial 


Partial 


Partial 


Complete 


Partial 


Partial 


Partial 


Complete 


Unlimited Audit 


Yes 


No 


No 


Yes 


Yes 


Yes 


Yes 


Yes 


No 



In subsequent work, Ateniese et al. [2] proposed a symmetric ciypto-based variation (SEP) which is compu- 
tationally efficient compared to the original PDP but lacks public verifiability. Also, both of these protocols 
considered the scenario with files stored on a single server, and do not discuss erasure tolerance. However, 
Curtmola et al. [7] extended PDP to a multiple-server scenario by introducing multiple identical replicas of 
the original data. Among other notable constructions of PDP, Gazzoni et al.[10] proposed a scheme (DDP) 
that relied on an RSA-based hash (exponentiating the whole file), and Shah et al. [23] proposed a symmetric 
encryption based storage audit protocol. Recent extensions on crypto-based PDP schemes by Wang et al. 
(EPV) [25] and Erway et al. [9] mainly focus on supporting data dynamics in addition to existing capabili- 
ties. Golle et al. [12] had proposed a cryptographic primitive called storage enforcing commitment (SEC) 
which probabilistically guarantees that the server is using storage whose size is equal to the size of the orig- 
inal data to correctly answer the data possession queries. In general, the drawbacks of the aforementioned 
protocols are: (a) being computation intensive due to the usage of expensive cryptographic primitives and 
(b) since each verification checks a random fragment of the data, a small fraction of data coiTuption might 
go undetected and hence they do not guarantee the retrievability of the original data. Coding-based ap- 
proaches, on the other hand, have relied on special properties of linear codes such as the Reed-Solomon 
(RS) [18] code. The key insight is that encoding the data imposes certain algebraic constraints on it which 
can be used to devise efficient fingerprinting scheme for data verification. Earlier schemes proposed by 
Schwarz et al. (SFC) [21] and Goodson et al. [13] are based on this and are primarily focused on the 
constiiiction of fingerprinting functions and categorically falls under distributed protocols for file integrity 
checking. Later, Juels and Kaliski [16] proposed a construction of a proof of retrivability (POR) which guar- 
antees that if the server passes the verification of data possession, the original data is retrivable with high 
probability. While the scheme by Juels [16], supported a limited number of verifications, the theoretical 
POR construction by Shacham and Waters [22] extended it to unlimited verification and public verifiability 
by integrating cryptographic primitives. Subsequently, Dodis et al. [8] provided theoretical studies on dif- 
ferent variants of existing POR scheme and Bowers et al. [5] considered POR protocols of practical interest 
[16, 22] and showed how to tune parameters to achieve different performance goals. However, these POR 
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schemes only consider the single server scenario and have no construction of a retrivability guarantee in a 
distributed storage scenario. Very recently, protocols developed by Wang et al. (SDS) [24] and Bowers et 
al. (HAIL) [4] focus on securing distributed cloud storage in terms of availability and integrity. Table 1 
summarizes and compares existing schemes with AIS. Asymptotic complexity of different operations are 
compared in Table 2. 



Table 2: Asymptotic performance comparison of existing schemes with the proposed scheme. We assume 
that data contains n symbols each of size logq bits, that it is then divided into s equal-sized blocks, with 
the blocks being distributed among the servers. We compare the token generation and verification for all s 
blocks. ^ is the fraction of symbols checked during each verification for the schemes which check partial 
data corruption. For cryptographic schemes, we assume is the cost of peifomiing modular exponentiation 
modulo m. Token generation/verification complexity is based on the number of bit operations. Storage and 
communication complexity is based on the number of bits. AIS-S refers to our proposed simple scheme 
where we generate one token for each sei-ver and AIS-E is a variation where a single token can verify 
multiple servers. Additional server storage refers to the amount of data that the server stores in addition to 
the original data, if any. 



Ope mrion/Schem e 


SFC [21J 


FOR [16] 


SDS [24] 


HAIL [4] 


DDP [10] 


PDF [1] 


EPV [25] 


SEC [12] 


AIS-S 


AIS-E 


Token Generation 


0{nlogq) 


0({ir)Iogq) 


O(nlogq) 


O(nlognlogq) 


0(nE,„) 


0(«£,„) 


0(nE,„) 


0(nE,„) 


Oilliogq) 


0((nls)logq) 


Proof Generation 


0{nlogq) 


0{(nli,)logci) 


0{{n/kVogq) 


0((n^)logq) 


0{nE,„) 




0((nim„,) 




0(nlogq) 


O(nlogq) 


Proof Verification 


0{nlognlogq) 


0(1) 


0{nlognlogq) 


0{nlognlogq) 


0(nE,„) 




0{(nm..) 


Oi(nm„) 


0(1) 


0(1) 


Client Storage 


0(1) 


0(slogq) 


0{slogq) 


o(i) 


0(1) 


0(1) 


0[slogm) 


Oislogin) 


O(slogq) 


0(1) 


Add. Server Storage 











0{nlogq) 





0{slogm) 


0(1) 


0{slogm) 








Communication Complexity 


O(slogq) 


0((nli,)togq) 


0{{nlk)logq) 


0((nli,)logq) 


0{slogm) 


0[slogm) 


0{slogm) 


0{slogm) 


O(slogq) 


0(slogq) 
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